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^ Abstract 

Consider the two related problems of sensor selection and sensor fusion. In 
the first, given a set of sensors, one wishes to identify a subset of the sensors, 
which while small in size, captures the essence of the data gathered by the 

hh sensors. In the second, one wishes to construct a fused sensor, which utilizes 

the data from the sensors (possibly after discarding dependent ones) in order 

i— i to create a single sensor which is more reliable than each of the individual 

ones. 

In this work, we rigorously define the dependence among sensors in terms 
of joint empirical measures and incremental parsing. We show that these 
measures adhere to a polymatroid structure, which in turn facilitates the 
application of efficient algorithms for sensor selection. We suggest both a 
random and a greedy algorithm for sensor selection. Given an independent 
{SI set, we then turn to the fusion problem, and suggest a novel variant of the 

exponential weighting algorithm. In the suggested algorithm, one competes 
against an augmented set of sensors, which allows it to converge to the best 

X 
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fused sensor in a family of sensors, without having any prior data on the 
sensors' performance. 
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1. Introduction 

Sensor networks are used to gather and analyze data in a variety of ap- 
plications. In this model, numerous sensors are either spread in a wide area, 
or simply measure different aspects of a certain phenomenon. The goal of 
a central processor which gathers the data is, in general, to infer about the 
environment the sensors measure and make various decisions. An example 
to be kept in mind can be a set of sensors monitoring various networking 
aspects in an organization (incoming and outgoing traffic, addresses, remote 
procedure calls, http requests to servers and such). In many such cases, an 
anomalous behavior detected by a single sensor may not be reliable enough to 
announce the system is under attack. Moreover, different sensors might have 
correlated data, as they measure related phenomenons. Hence, the central 
processor faces two problems. First, how to identify the set of sensors which 
sense independent data, and discard the rest, which only clutter the decision 
process. The second, how to intelligently combine the data from the sensors 
it selected in order to decide whether to raise an alarm or not. 

In this work, we target both problems. First, we consider the problem 
of sensor selection. Clearly, as data aggregated by different sensors may be 
highly dependent, due to, for example, co-location or other similarities in 
the environment, it is desirable to identify the largest set of independent (or 
nearly independent) sensors. This way, sensor fusion algorithms can be much 
more efficient. For example, in the fusion algorithm we present, identifying 
the set of independent sensors allows us to create families of fused sensors 
based on fewer sensors, hence having a significantly smaller parameter space. 
Moreover, identifying independent sensors is of benefit also to various con- 
trol methods, were a few representative independent inputs facilitate easier 
analysis. Note that the sensor selection problem is different from the data 
compression problem, where the dependence among the data sets is reduced 
via some kind of Slepian-Wolf coding pQ. Herein, we do not wish all data to be 
reconstructed at the center, but focus only identify good sets of independent 
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sensors, such that their data can be analyzed, disregarding other sensors. In 
other words, we do not wish to replace Slepian-Wolf coding by sending data 
of independent sensors, only identify the independent subsets. For example, 
the randomized algorithm we suggest gathers data only from small subsets of 
the sensors, yet is assured to identify independent sets with high probability. 
In a similar manner, A greed algorithm we suggest can identify subset of 
sensors with relatively high independence among them (compared to other 
subsets), even in cases we do not wish to identify a subset containing all the 
information. 

Given two data sets, a favored method to measure their dependence is 
through various mutual information estimates. Such estimates arise from cal- 
culating marginal and joint empirical entropies, or the more efficient method 
of incremental (Lempel-Ziv) parsing [2j. Indeed, LZ parsing was used, for 
example, for multidimensional data analysis [3], neural computation [I] and 
numerous other applications. However, although the ability of the parsing 
rule to approximate the true entropy of the source, and hence, as one possi- 
ble consequence, identify dependencies in the data, applications reported in 
the current literature were ad-hoc, using the resulting measures to compare 
between mainly pairs of sources. 

To date, there is no rigorous method to analyze independence among large 
sets, and handle cases where one sensor's data may depend on measurements 
from many others, including various delays. In this work, we give the math- 
ematical framework which enables us to both rigorously define the problem 
of identifying sets of independent sources in a large set of sensors and give 
highly efficient approximations algorithms based on the observations we gain. 

Still, when no single sensor is reliable enough to give an accurate estimate 
of the phenomenon it measures, sensor fusion is used [5J, El U\- In the second 
part of this work, we consider the problem of sensor fusion. In this case, for a 
given set of sensors, one wishes to generate a new sensor, whose performance 
over time is (under some measure) better than any single sensor in the set. 
Clearly, in most cases, choosing the best-performing sensor in the set might 
not be enough. We wish, in general, to create a sensor whose performance 
is strictly better than any given sensor in the original set, by utilizing data 
from several sensors simultaneously and intelligently combining it. 

1.1. Contribution. 

Our main contributions are the following. First, we show how to harness 
the wide variety of algorithms for identifying largest independent sets in 
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matroids, or the very related problems of minimum cycle bases and spanning 
trees in graphs to our problem of identifying sets of independent sensors in 
a sensor network. Our approach is based on highly efficient (linear time in 
the size of the data) methods to estimate the dependence among the sensors, 
such as the Lempel-Ziv parsing rule [2j. The key step, however, is in showing 
how these estimates can either yield a polymatroid, or at least approximate 
a one, thus facilitating the use of polynomial time algorithms to identify the 
independent sets, such as [HI [9]. We construct both random and a greedy 
selection algorithms, and analyze their performance. 

We then turn to the problem of (non-correlated) sensor fusion. In partic- 
ular, we describe an online fusion algorithm based on exponential weighting 
|10j . While weighted majority algorithms were used in the context of sensor 
networks [HI [TJl [13], in these works, the exponential weighting was used 
only to identify good sensors and order them by performance. Hence, ap- 
plied directly, this algorithm does not yield a good fused sensor. In this part 
of our work, we suggest a novel extension by creating parametric families of 
synthesised sensors. This way, we are able to span a huge set of fused sensors, 
and choose online the best fused sensor. That is, given a set of sensors S, 
this algorithm constructs synthesised sensors, from which it selects a sensor 
whose performance converges to that of the best sensor in both S and the 
constructed parametric family of synthesised sensors. In other words, this 
algorithm results in online sensor fusion. 

We rigorously quantify the regret of the suggested algorithm compared 
to the best fused sensor. In this way, a designer of a sensor fusion algorithm 
has a well-quantified trade-off: choosing a large number of parameters, thus 
covering more families of fusion possibilities, at the price of higher regret. 

2. Preliminaries 

The basic setting we consider is the following. A set of sensors, S = 
{Si, . . . , Sk} is measuring a set of values in a certain environment. Each 
sensor may depend on a different set of values, and may base its decision on 
these values in a different way. However, each sensor, at each time instance, 
estimates whether a target exists in the environment or not. Thus, the input 
to sensor Sj at time t is some vector of measurements V t J , based on which 
it will output a value pi G [0, 1], which is his estimate for the probability a 
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target exists at time tj^J Throughout, capital letters denote random variables 
while lower case denotes realizations. Hence, P/ denotes the possibly random 
output of sensor Sj, j — 1, 2, . . . , K at time t, t — 1, 2, . . . , n. 

Let {x t }t = i be the binary sequence indicating whether a target actually 
appeared at time t or not. The normalized cumulative loss of the sensor St- 
over n time instances is defined as 

1 n 

L S] (x n 1 ) = -J2d(p J t ,x t ), (1) 

n z — 4 

t=i 

for some distance function d : [0, 1] x {0, 1} R. If the sensor's output is 
binary (a sensor either decides a target exists or not), then p\ £ {0, 1} and a 
reasonable distance measure is the Hamming distance, that is, 

7/ , N f if a = b 
d ^ = { 1 if a ^6. 

If the sensor's output is in [0, 1], then we think of it as the sensor's estimate 
for the probability a target exists, and a reasonable d is the log-loss, 

d(p, x) = —x log(p) — (1 — x) log(l — p). 

In any case, the goal of a good sensor S is to minimize the normalized cumu- 
lative loss Ls(x±), as given by 0. Roughly speaking, in the first part of this 
paper, we wish to use the estimates {P/} to identify dependencies among the 
sensors, and in the second part we wish to construct fused sensors, whose 
cumulative loss is lower than any given sensor. 

2.1. Polymatroids, Matroids and Entropic Vectors. 

Let /C be an index set of size K and M be the power set of /C. A function 
g : jV i->- R defines a polymatroid (/C, g) with a ground set /C and rank 
function g if it satisfies the following conditions |14j : 

0(0) = 0, (2) 
g(I) < g( J) for / C J C /C, (3) 

g{i) + g{J) > g(WJ) + g{inJ)iorl,JcK:. (4) 



lr The model of target identification and the requirement for p° t £ [0, 1] is mainly for 
illustration purposes. The algorithms we describe are suitable for different sets as well, 
with the adequate quantization or scaling. 
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For a polymatroid g with ground set /C, we represent g by the vector (g(I) : 
/ C /C) G M 2 _1 defined on the ordered, non-empty subsets of K. We denote 
the set of all polymatroids with a ground set of size K by IV. Thus w 6 IV 
if and only if w(I) and w(J) satisfy equations ([2]) (j4j) for all /, J C /C, where 
is the value of w at the entry corresponding to the subset /. If, in 
addition to Q-Q, g(-) G Z + and g(I) < \I\, then (JC,g) is called a matroid. 

Now, assume K. is some set of discrete random variables. For any A C /C, 
let denote the joint entropy function. An entropy vector w is a (2^ — 1)- 

dimensional vector whose entries are the joint entropies of all non-empty 
subsets of fC. It is well-known that the entropy function is a polymatroid over 
this ground set /C. Indeed, ([2])-(|4]) are equivalent to the Shannon information 
inequalities |15j . However, there exists points w G Tk (K > 3) for which 
there is no set of K discrete random variables whose joint entropies equal 
w. Following |16] , we denote by T* K the set of all w e Tk for which there 
exists at least one random vector whose joint entropies equal w. A w e 
is called entropic. Finally, denote by T* K the convex closure of T* K . Then 

= for K < 3 but t* K ^ T K for K > 3 p]. 

3. A Matroid-Based Framework for Identifying Independent Sen- 
sors 

In this section, we use the incremental parsing rule of Lempel and Ziv j2] 
to estimate the joint empirical entropies of the sensors' data. We then show 
that when the sensors data is stationary and ergodic, the vector of joint em- 
pirical entropies can be approximated by some point in the polyhedral cone 
Tk- In fact, this point is actually in Pj^. As asymptotically entropic poly- 
matroids are well approximated by asymptotically entropic matroids [TTt 
Theorem 5], the point in M. n which corresponds to the joint empirical en- 
tropies of the sensors is approximated by the ranks of some matroid. This 
enables us to identify independent sets of sensors, and, in particular, largest 
independent sets, by identifying the bases (or circuits) of the matroid. Do- 
ing this, the most complex dependence structures among sensors, including 
both dependence between past /future data and dependence among values at 
the same time instant can be identified. Non-linear dependencies are also 
captured. 

We now show how to approximate an entropy vector (hence, a polyma- 
troid) for the sensor data. We prove that indeed for large enough data and 
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ergodic sources the approximation error is arbitrarily small. This polyma- 
troid will be the input from which we will identify the independent sensors. 

We first consider the most simple case in which one treats the sensors 
as having memoryless data. That is, sensors for which each reading (in 
time) is independent of the previous or future readings. Note, however, that 
this model still allows the reading of a sensor to depend on the readings 
of other sensors at that time instant. The dependence might be a simple 
(maybe linear) dependence between two sensors, or a more complex one, 
where one sensor's output is a random function of the outputs of a few 
others. It is important to note that it is inconsequential if the sensors are 
indeed memoryless or not. Using this simplified method, only dependencies 
across a single time instant will be identified. A generalization for time- 
dependent data appears in the next sub-section. 

For the sake of simplicity, assume now all P/ are binary. Given a sequence 
{pi}i =1 , denote by N(0\{pi}™ =l ) and N(l\{pi}2 =1 ) the number of zeros and 
ones in {pi}f =1 , respectively. That is, 

n 

^(oite}r =1 ) = ^i te =o } , 
i=i 

where is the indicator function. When the sequence indices are clear 
from the context, we will abbreviate this by N(0\p). Hence, 

= (-N(0\p),-N(l\p)) 

denotes the type of the sequence p, that is, its empirical frequencies [18] . 

In a similar manner, we define the empirical frequencies of several se- 
quences together, e.g. pairs. For example, 

n 

^(MI^p 2 ) = £i {pH o, pH i}- 

In this case, the 4-tuple 

= (-N (0, 0\p\p 2 ) , -N (0, l|pV) , -N (1, 0\p\p 2 ) , -N (l, lbV)) 

denotes the joint type of (p 1 ,^ 2 ), hence, it includes the empirical frequencies 
of the two sequences together, over their product alphabet {0,1} x {0,1}. 
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For more than two sequences, we denote by T" x pS the joint type of the 
sequences (p 7 , j G S). 

For a probability vector q = (gi, . . . ,q m ), let H(q) denote its entropy, 
that is, 

m 

H(q) = -^^logfe). 

i=i 

Let w n be the (2 K — l)-dimensional vector whose entries are all the joint 
empirical entropies calculated from {{pi) jes}t=i- I- e > 

Wn £ (h(t£), . . .,h{t; k ih{t;^ih{t^\ . . . ,#(t; v .^)) 

Under these definitions, we have the following. 

Proposition 1. For every realization of the sensors' data, w n G T* K . 

Proof. We wish to show that the vector of joint empirical entropies, w n , is 
entropic for any finite n. Hence, w n G T* K . The important observation is that 
empirical measures (as defined herein) are legitimate probability measures 
(even if the approximation error compared to the true measure is large), 
hence entropies calculated from them give rise to an entropic polymatroid. 

Since for any subset /, T™ . gJ clearly defines a valid distribution (all 
entries are in [0, 1] and they sum up to 1), consistency is the only property it 
remains to show. Assume all sensors' output belong to some finite alphabet 
A. We have 

1 1 n 

-N(a 1 ,...,a K \p 1 ,...,p K ) = -J2J2 1 {P 1 =ai,.., P K =a K } 

a±eA ai£A i=l 

1 n 

~ „ l{p 2 =a 2 ,...,p K =a K } 
H i=l 

= -N(a 2 ,...,a K \p 2 ,...,p K ) 
n 

which completes the proof. □ 
Let w denote the true (memoryless) entropy vector of the sources. That 

is, 

w ^ (//(Pi), . . . , H(P*\ H(Pl A 2 ), H(Pl Pi),..., H{Pl . . . , If)) . 

For stationary and ergodic sources, the following Proposition is a direct ap- 
plication of Birkhoff 's ergodic theorem. 
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Proposition 2. Let {(pl)j£s}t=i be drawn from a stationary and ergodic 
source {(Pi)jes}t^i with some probability measure Q. Then, for any subset 
S' C S, we have lim^oo H (T™- ^ s ,) = H ((P()jes') Q-a.s. (almost surely). 
As a result, Pr (lim^oo w n = w) = 1. 

That is, the entropy calculated from the empirical distribution converges 
to the true entropy. Moreover, the vector of empirical entropies converges 
almost-surely (a.s.) to the true entropy vector, which is, of course, an en- 
tropic polymatroid. To be able to harness the diverse algorithmic literature 
on matroids (such as matroid optimization relevant for our independence 
analysis application), we mention that by [I7J Theorem 5], describing the 
cone of asymptotically entropic polymatroids, f * K , is reduced to the problem 
of describing asymptotically entropic matroids. 

3.1. Dependence Measures for Sensors with Memory. 

Till now, we considered sensors for which the data for any individual 
sensor is a stationary and ergodic process, yet, through first-order empirical 
entropies, only the dependence along a single time instant was estimated. 
While being very easy to implement (linear in the size of the data), this 
method fails to capture complex dependence structures. For example, con- 
sider a sensor whose current data depends heavily on previous data acquired 
by one or several other sensors. 

To capture dependence in time, we offer the incremental parsing rule [2] 
as a basis for an empirical measure. We show that indeed such a measure will 
converge almost surely to a polymatroid, from which maximal independent 
sets can be approximated. We start with a few definitions. 

Let {pi}" =1 be some sequence over a finite alphabet of size a. The ZL78 
[2] parsing rule is a sequential procedure which parses the sequence p in a 
way where a new phrase is created as soon as the still unparsed part of the 
string differs from all preceding phrases. For example, the string 

0100011011000001010011 ... 

is parsed as 

0, 1, 00, 01, 10, 11, 000, 001, 010, 011, .... 

Let c({pi}™ =1 ) denote the number of distinct phrases whose concatenation 
generates {pi}^ =1 . Furthermore, let PE(s)({pi}^ =1 ) denote the compression 
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ratio achieved by the best finite-state encoder with at most s state, and 
define 

p(p) = lim limsupp£ (s) ({pi}" =1 ). 
In a nutshell, the main results of [2] states that on the one hand 

c({ Pi }U)logc({ Pi }U) 



pip) > lim sup 



nloga 



where a is the alphabet size. On the other hand, for any sequence {p«}" =1 , 
there exists a finite state encoder with a compression ratio Pe({Pi}" = i) sat- 
isfying 

p E {{Pi)U) < C{{Pl} ;= l} + 1 log (2a(c(fe}r =1 ) + 1)) . 
n log a 



Thus 



na 

is an asymptotically attainable lower bound on the compression ratio pip)- 
Denote by H{P) the entropy rate of a stationary source P, that is, 
lim^oo ^H(P 1 , . . . , P n ). For K sources P 1 , . . . , P K , the entropy rate vector 
w is defined as 

w £ (HiP 1 ), H(P K ), H{P\ P 2 ), H{P\ P 3 ), . . . , H(P\ . . . , P K )) . 

Analogously to the memoryless case, herein we also define the joint parsing 
rule in the trivial way, that is, parsing any subset of 1 < k < K sequences as 
a single sequence over the product alphabet. Define the LZ-based estimated 
entropy vector as (suppressing the dependence on n) 

w lz a V); H LZ {p K ), H LZ {p\p% . . . , H LZ ip\ . . . ,p K )). 
The following is the analogue of Proposition [2] for the non-memoryless case. 

Proposition 3. Let {ipl)j e s}™ =l be drawn from a stationary and ergodic 
source {(P/^es}^!- Then, w e T* K and we have 



Pr ( lim w^ z = w) = 1. 

\n— >oo / 
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Proof. We wish to see that w^ z converges to w, and that indeed w G T* K . It 
is not hard to show that w G T* K . To see this, remember that H({P?}™ =1 , j G 
/), ranging over all subsets / C S forms an entropic polymatroid [15J. Hence 
-H({P-}f =l ,j G I) forms an asymptotically entropic polymatroid (as the 
closure of the entropic region is convex), hence w G f^. 

In [2J, it was proved that for stationary and ergodic sources each entry in 
w^ z converges to the true entropy rate. That is, 

hm H LZ (P\i G /) = lim -H({Pj 3 e I) a.s. 

n— >oo n— >oo 77, 

Since we have only finitely many entries in w^ z , a simple union bound gives 
Proposition [3j □ 

Note, however, that the analogue of Proposition [T] is not true in this case. 
That is, for finite n, w^ z might not satisfy the polymatroid axioms at all. 
Nevertheless, by Proposition [IJ for large enough n, w^ z is sufficiently close 
to f^. A fortiori, it is sufficiently close to Tk- Moreover, for ergodic sources 
with finite memory, namely, sources for which 

P r (-Pn = a n\Pn-l = a n-l,Pn-2 = a n-2, ■ ■ •) 

Pl(P n Qi n \P n —\ Q n —i, . . . , P n — m Q-n— m) 

for some finite m, there exist a few strong tail bounds on the probability that 
the LZ compression ratio exceeds a certain threshold. For example, if ||w||q 
denotes the maximal entry in w, we have the following proposition. 

Proposition 4. Let {{Pt)jes}t=i be drawn from a stationary and ergodic 
Markov source {{Pt)jes}t^i- Then, with probability at least 1 — 0( 2 \J^ ), 
llw^ wll < R ( pj > jeS ) 

ll W n — W ll0 S 



log n 

Proof. By [191 Corollary 2], we have 



Remembering that H{P l , % G /) < H(P l , % G <S) for any / C S and using the 
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union bound on all entries of wf z results in 



Pr (yjiw™-^ ie /)i>^M}) 
<p r (u{iw^)-^>^)i>^^ 




which completes the proof. □ 

The usefulness of Proposition [4] is twofold. First, it gives a practical 
bound on the approximation the vector w^ z gives to w. However, assume 
w is a matroid. This is the case, for example, when bits in the sensors' 
data are either independent or completely dependent (in fact, in this case 
w is a linearly representable binary matroid). Since w^ z might not satisfy 
the polymatroid axioms at all, using Proposition [4] one can then easily check 
when can the entries of be rounded to the nearest integer in order to 
achieve w exactly. 

Remark 1. We mention that a different approach to target sensors with mem- 
ory is to calculate high order empirical entropies, that is, entropies calculated 
from the frequency count of the data seen by a sliding window of a fixed length 
I. With this approach, the achieved vector is entropic (hence a polymatroid) 
for any finite n. Moreover, with a good tail bound such as [20] for irreducible 
Markov chains over a finite alphabet, we are able to show fast convergence to 
the true vales. The complexity, however, grows exponentially with I. Thus, 
approaching entropy rates in order to capture long-time dependencies is of 
exponential complexity. In the LZ method we suggest, while the alphabet 
size indeed grows exponentially, complexity is a function of log the alphabet 
size. 



4. Identifying Independent Sets of Sensors. 

When the number of sensors is small, and the complexity of calculating 
all entries of w^ z is reasonable, one can find a subset with high enough 
entropy (strong independence) by simply taking the smallest set of sensors 
with high enough cl ° sc . However, when the number of sensors is larger (even 
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a few dozens), this method is prohibitively complex, and more suffisticated 
algorithms (and their analysis) are required. 

Thus, having set the ground, in this section we utilize optimization al- 
gorithms for submodular functions, and matroids in particular, in order to 
find maximal independent sets of sensors efficiently Herein, we include two 
examples: a random selection algorithm, which fits cases where true data 
forms a matroid, for which possibly many subsets of sensors include the de- 
sired data, and a greedy algorithm, which easily fits any dependence structure 
(while matroids asymptotically span the entropic cone, an additional approx- 
imation step is required [H]). It is important to note that, unlike the greedy 
selection (also used in [21] in the context of maximum a posteriori estimates) 
which approximates the optimum value up to a constant factor, the random 
selection process we suggest here can guarantee exact approximation. 

Algorithm RandomSelection 

% Input: A set of S sensors. A parameter < q < 1. 

% Output: A subset I C S, of expected size qK, which with high probability 
contains a maximal independent set of S (see conditions in Corollary [T]) . 

• Include a sensor j in subset / with probability q, independently of the 
other sensors. 



The randomized algorithm is given in Algorithm RandomSelection. As 
simple as it looks, by Proposition [4] and [81 Theorem 5.2], under mild as- 
sumptions on the true distribution of the data, it guarantees that indeed 
with high probability such a random selection produces a subset of sensors 
which is a g-fraction of the original, yet if the original contains enough bases 
(maximal independent sets), then the subset contains a base as well. This is 
summarized in the following corollary. 

Corollary 1. Let {{pDjes}]!^ be drawn from a stationary and ergodic Markov 
source. Assume that w is a matroid of rank r which contains a + 2 + - lnr 

disjoint bases. Then, with probability at least 1 — e~ aq — 0( 2 ^ 1 ), the subset I 
produced by Algorithm RandomSelection contains a maximal independent set 
of sensors. 

At first sight, Algorithm RandomSelection does not depend on any of the 
discussed dependence measures in this paper. Yet, it power is drawn from 



13 



Algorithm GreedySelection 

/0 111JJUI. YJ <X\>(X Ul 11 ocllSUlo, \Pl , • • 

% Output: At each time instant, a 


• iVt H=V 

set J of sensors. 


• Initialization: I = (p, H = 0. 




1. j* = argmax^w^J U {j}) 




2. if w£ z (/U{j*}) ># 




• / <h- then # <h- 




• Go to step [1] 





them: once we have established the estimated entropy vector as the key 
variable in determining dependence, we know that this asymptotic matroid 
is the one we should analyze for independent sets, according to its features 
we should choose the parameters in RandomSelection and these features will 
indeed eventually determine the success probability of RandomSelection. 

On the other hand, algorithm GreedySelection takes a different course of 
action, to answer a slightly different question: how to choose a small set 
of sensors with a relatively hight entropy (hence, independence)? How bad 
can one subset of sensors be compared to another of the same size? What 
is a good method to choose the better one? The algorithm sequentially 
increases the size of the sensors set I until its entropy estimate w^ z (J) does 
not grow. In a similar manner, one can choose empirical entropies. Due to 
the polymatroid properties we proved in the previous section, a bound on 
the performance compared to the optimum can be given. 

The LZ parsing rule on an alphabet of size a can be implemented in 
O(nloga) time (using an adequate tree and a binary enumeration of the 
alphabet). Hence, the complexity of GreedySelection is 0(nK 3 ). [22] analyzed 
the performance of greedy schemes for submodular functions. As noted in 
[2T] also for such algorithms, they achieve a factor of 1 — - of the optimum. 

In practice, it might be beneficial to stop the algorithm if the entropy 
estimate does not grow above a certain threshold, to avoid steps which may 
include only a marginal improvement. In fact, this is exactly where the 
polymatroid properties we proved earlier kick in, and we have the following. 
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Proposition 5. Assume Algorithm GreedySelection is stopped after the first 
time w^ z (I) was incremented by less than some e > 0. Then, for station- 
ary and ergodic sources, the difference between the entropy of the currently 
selected subset of sensors and the entropy that could have been reached if the 
algorithm concluded is upper bounded by Ke + o(l). 

Proof. We wish to prove that if at some stage of the algorithm the improve- 
ment was some e > 0, then no further step can improve by more than e, and 
hence the total improvement (till completion) is bounded by about Ke. To 
show this, we use the polymatroid axiom, and the fact that the LZ parsing 
rule estimates the entropy up to an additive estimation error of o(l) (as n 
increases) . 

Let w^ z (I) be the estimated entropy at step t of the algorithm, and 
w^ z (J U {j*}) be the estimated entropy at step t + 1. We know that 

wf(/U{f})<wf(/)+6. 

Also, for stationary and ergodic sources, with high probability, 

\H(P>,jeS')-w L n z (S')\ = o(l) 

for any subset S' C S. Assume that at step t + 2 the algorithm added a 
sensor j' ^ j* to the set / U {j*}, such that 

w^(/U{j*}U{/})-w^(/U{j*})> e . 

In this case, we have 

wf(!U{/}) > H{P*,jelU{j'})-o(l) 

= H{P\j e I U {f} U {/}) - H(P** \P>,j G / U {/}) - o(l) 

> H(P\j elu {f} u {/}) - H(P**\P\j el)- o(l) 
= H(pi,jelu{f}u{j'}) 

-H(pi,j e I u {f}) + H{P\j el)- o(l) 

> < Z {I U {f} U {/}) - w L n z (I U {f}) + w L n z (I) - 0(1) 

> e + w ^(/)-o(l). 

Hence, selecting j' instead of j* at step t + 1 would have been a better 
choice, which contradicts the greedy nature of the algorithm: we assumed 
j* was selected to maximize w^ z (J U {j}) over all possible j. As a result, 
no further step can improve by more than e, and since there are at most K 
steps left, the proposition follows. □ 
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5. A Sensor Fusion Algorithm Via Exponential Weighting 

In this section, we present an online algorithm for sensor fusion. In [10J, 
Vovk considered a general set of experts and introduced the exponential 
weighting algorithm. In this algorithm, each expert is assigned a weight 
according to its past performance. By decreasing the weight of poorly per- 
forming experts, hence preferring the ones proved to perform well thus far, 
one is able to compete with the best expert, having neither any a priori 
knowledge on the input sequence nor which expert will perform the best. 
This result was further extended in [23], where various aspects of a "weighted 
majority" algorithm were discussed. In [211 |25| [26] . lower bound on the re- 
dundancy of any universal algorithm were given, including very general loss 
functions. It is important to note that the exponential weighting algorithm 
assumes nothing on the set of experts, neither their distribution in the space 
of all possible experts nor their structure. Consequently, all the results are of 
the "worst case" type. Additional results regarding a randomized algorithm 
for expert selection can be found in [27] and [28J. 

The exponential weighting algorithm was found useful also in the lossy 
source coding works of Linder and Lugosi [29], Weissman and Merhav [30J, 
Gyorgy et. al. [5T] and the derivation of sequential strategies for loss functions 
with memory [52] ■ A common method in these works is the alternation of 
experts only once every block of input symbols, necessary to bear the price 
of this change (e.g., transmitting the description of the chosen quantizer [29J- 
[31] ) . A major drawback of all the above algorithms is the need to compute 
the performance of each expert at every time instant. In [5TJ, though, Gyorgy 
et. al. exploit the structure of the experts (as they are all quantizers) to 
introduce an algorithm which efficiently computes the performance (or an 
approximation of it) of each expert at each stage. 

In this work, we offer to use a sequential strategy similar to the one used 
for loss functions with memory [52J and scanning of multidimensional data 
[531 EH] in order to weight the sensors and identify the best fused sensor. 
However, given a set of sensors S, our goal is to construct a new sensor, S, 
whose output depends on the outputs of the given sensors, yet its performance 
is better than the best sensor in the set S. We call S a synthesised (fused) 
sensor. Clearly, when the true target appearance sequence x" is known in 
advance, suggesting such a sensor is trivial. However, we are interested in 
an online algorithm, which receives the sensors' outputs at each time instant 
t, together with their performance in the past (calculated by having access 
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to x t i for t' < t or estimating it), and computes a synthesised output. We 
expect the sequence of synthesised outputs given by the algorithm at times 
t = 1, ... ,n to have a lower cumulative loss than the best sensor in S, for 
any possible sequence x\ and any set of sensors S. 

Towards this goal, we will define a parametric set of synthesised sensors. 
Once such a set is constructed, say Sq for some set of parameters 6 (that 
is, | ©| possible new sensors), we will use the online algorithm to compete 
with the best sensor in S U Sq. Clearly, a good choice for Sq is such that 
on the one hand |iSU<Se| is not too large, yet on the other hand Sq includes 
"enough" good synthesised sensors, so the best sensor in S U Sq will indeed 
perform well. 

Example 1. A simple example to be kept in mind is a case where the set 
of sensors, S±, . . . , Sk, has the property such that all under-estimate the 
probability that a target exists (for example, since each sensor measures a 
different aspect of the target, which might not be visible each time the target 
appears). In this case, a sensor S whose output at time t is maxj-fj^, 1 < 
j < K} will have a much smaller cumulative loss L§(x™) compared to any 
individual sensor, Ls(xi). As a result, when designing families of synthesised 
sensors for such a set of K sensors, one can think of a set synthesised family 
S m , which includes, for example, all sensors of the type maxj^ 1 , . . . for 
some subset {ji, . . . , j m } C {1, . . . K}. If the miss-detection probabilities of 
the sensors are not all equal, clearly some synthesised set of m sensors will 
perform better than the others. 

This example can be easily extended to a case where sensors either under- 
estimate or over-estimate. Following a single sensor will give a non-negligible 
error, while a simple median filter (sensor-wise) on a sufficiently large set of 
sensors might give asymptotically zero error. 

5.1. Exponential Weighting for a Parametric Family of Sensors. 

Recall that for any time instant t < n, Ls.(x*) denotes the intermediate 
normalized cumulative loss of sensor Sj. Hence, tLs 3 (x\) is simply the unnor- 
malized cumulative loss until (and including) time instant t. For simplicity, 
we denote this loss by Lj t . Furthermore, note that for each 1 < j ' < K + \Q\, 
Ljfl = 0. At each time instant t, the exponential weighing algorithm assigns 

each sensor Sj G S U Sq a probability -P*(j|{-^i,t}^' 6 ')- That is, it assumes 
the cumulative losses of all sensors up to time t are known. Then, at each 
time instant t, after computing Pt(j\{Ljj}f=i°^), the algorithm selects a sen- 
sor in S U Sq according to that distribution. The selected sensor is used 
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to compute the algorithm output at time t + 1, namely, the algorithm uses 
the selected sensor as the synthesised sensor S at time t + 1. Note that this 
indeed results in a synthesised sensor, as even if it turns out that the best 
sensor at some time instant is in S, it is not necessarily always the same 
sensor, hence the algorithm output will probably not equal any fixed sensor 
for all time instances 1 < t < n. The suggested algorithm is summarized 
in Algorithm OnlineFusion below. The main advantage in this algorithm is 



Algorithm OnlineFusion 

% Input: K + |0| sensors, S U Sq; Data x™, arriving sequentially 
% Output: At each time instance, a synthesised sensor S e SUSe, chosen at 
random, such that the excess cumulative loss compared to the best synthe- 
sised sensor is almost surely asymptotically (in n) negligible (see Proposition 
6] and the discussion which follows) . 

• Initialization: 

W = K+\Q\; Vi^<* + |e| h = 0, P(j\{}) = i; V =- • / 8I ° g(X+|e|) 

• For each t — 1, . . . , n: 

— Choose S according to P(j|{}). 

— For each j = 1, . . . , K + \Q\: 

* Lj <- Lj + d(p J t ,x t ). 

— For each j — 1, . . . , K + |G|: 

* p(M}) s?- 



that, under mild conditions, the normalized cumulative loss of the synthesised 
sensor S it produces is approaching that of the best sensor in S U Sq, hence 
it converges to the best synthesised sensor in a family of sensors, without 
knowing in advance which sensor that might be. By the standard analysis of 
exponential weighing, the following proposition holds. 

Proposition 6. For any sequence x™, any set of sensors S of size K and 
any set of synthesised sensors Sq, the expected performance of Algorithm 
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OnlineFusion is given by E[L§(a%)] < min 5eiSUl s e L s (xf) + d max ^ log(K 2 + m , 
where the expectation is over the randomized decisions in the algorithm and 
dmax is some upper bound on the instantaneous loss. 



For completeness, a proof is given in |Appendix A| As a result, as long 
as log(i^ + |G|) = o(n) the synthesised sensor S has a vanishing redundancy 
compared to the best sensor in S U S@ . This gives us an enormous freedom in 
choosing the parametrized set of sensors Sq, and even sets whose size grows 
polynomially with the size of the data are acceptable. 

The performance of the exponential weighting algorithm can be summa- 
rized as follows. For any set of stationary sources with probability measure 
Q, as long as the number of synthesised sensors does not grow exponentially 
with the data, we have 

]iminfE Q EL§(X?) <liminf min E Q L 3 (X?), 

n— >oo n— s>oo SGiSUiSe 

where the inner expectation in the left hand side is due to the possible ran- 
domization in S. When the algorithm bases its decisions on independent 
drawings, we have 

lim Lq(x?) < lim min Lq(x^) 

n->oo ^ n->oo SeSUSe 

almost surely (in terms of the randomization in the algorithm). If, further- 
more, the sources are strongly mixing, almost sure convergence in terms of 
the sources distribution is guaranteed as well 



liminf L § (AT) < liminf min L S (X?), Q-a.s. 

n-s-oo * v 1 ' ~ n->oo 5G-SU5 e 

A by product of the algorithm is the set of weights it maintains while run- 
ning. These weights are, in fact, good estimates of the sensors' reputation. 
Moreover, such weights can help us make intelligent decisions for synthesised 
control and fine-tuning of the sensor selection process, namely, we are able to 
clearly see which families of synthesised sensors perform better, and within 
a family, which set of parameters should be described in higher granularity 
compared to the others (since sensors with these values perform well). Fi- 
nally, note that this is a finite horizon algorithm, since the optimal r] depends 
on the size of the data, n. One can loose the dependence on the size of the 
data easily by working with exponentially growing blocks of data. 
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6. Results on Real and Artificial Data 

To validate the proposed methods in practice, simulations were carried 
out on both real and synthetic data. We present here some of the results. 

To demonstrate Algorithm OnlineFusion, We used real sensors data col- 
lected from 54 sensors deployed in the Intel Berkeley Research Lab between 
February 28th and April 5th, 2004^] To avoid too complex computations, we 
used only the first 15 real sensors (corresponding to a wing in the lab) and 
artificially created from them 225 fused (synthesised) sensors. For this basic 
example, the fused sensors were created by simply averaging the data of any 
two real sensors. Yet, the results clearly show how the best fused sensor 
outperforms the best real sensor, with very fast convergence times. Figure [6] 
demonstrates the convergence of the weight vectors created by the algorithm. 
At start (left column), all weights are equal. Very fast, the two best sensors 
have a relatively high weight (approximately 0.5), while the weight of the 
others decrease exponentially. Hence, the algorithm identifies the two best 
sensors very fast. The two best sensors are indeed synthesised ones, with the 
real sensors performing much worse. Note that there was no real data (xt) for 
this sample. The real data was artificially created from all 15 sensors with a 
more complex function than simple average (first, artifacts where removed, 
then an average was taken). Thus, an average over simply two sensors, yet 
the best two sensors, outperforms any single one, and handles the artifacts in 
the data automatically. Figure [6] depicts the data of two random real sensors 
(to avoid cluttering the graph), the artificially created true data x t and the 
best synthesised sensor. 

To demonstrate the greedy and random selection algorithms, we used 
the same data. Table [T] includes the results. The entropy of the maximal 
triplet of sensors can be compared to that of random selections of triplets. 
Note that since many sensors are spread in a relatively small aria, there are 
several triplets which include an amount of information very close to the 
maximal (for a triplet). To get a sense of how correlated sensors can be, the 
entropy of a minimal triplet (also achieved by a greedy algorithm) is also 
depicted. 

We also demonstrate the random sensor selection algorithm on artificial 
data. To do this, we artificially created randomized data for 5 independent 
sensors, and used them to create 5 additional depend ones, which are a 



2 For details, see http://db.csail.mit.edu/labdata/labdata.html. 
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Figure 1: Weight vectors generated by OnlineFusion for 240 sensors - 15 real sensors and 
225 synthesised ones. While at first (left column) all sensors have equal weight, as time 
evolves (towards the right) some sensors gain reputation (white color), while others loose 
it exponentially fast (dark color). Very soon, the two best sensors are identified. 

function of the original sensor. Sensors with even numbers are independent 
of each other, while sensors with odd number are linearly dependent on the 
even number sensors. Note that this is a very simplified model, which is 
included here only to demonstrate in practice the number of rounds the 
random selection algorithm requires in order to find an independent set. 
Furthermore, note that sensors depending on others and additional data 
may still be independent of each other, depending on the other sensors in 
the group. For example, if P 1 and P 2 are independent bits (with entropy 1 
each), and P 3 = P 1 © P 2 , then P 2 and P 3 are still independent, with joint 
entropy 2, while the three are dependent, with joint entropy 2 as well. 

The algorithm then chose sets of 5 sensors at random. Entropy estimates 
of the 5 selected sensors are computed according to the joint first order 
probability estimate, that is, HiT^ pJ5 ), where p* 1 , . . . ,p^ B is the data for 
the five selected sensors. It is easy to see from Table [2] that 5 independent 
sensors were drawn very fast, with 4 out of 20 trials succeeding. 

Appendix A. Proof of Proposition [6] 

We follow the analysis of exponential weighing, similar to [32] • A similar 
analysis was also used in [33]. In our setting, however, there is no notion of 
block size (so one can assume data is processed in blocks of size 1). 

For some rj > define 

K+\e\ 

w t = e ~ r,L3,t 

3=1 
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Figure 2: Real temperature data from the Intel Research Lab, Berkeley. 



and let the probability distribution assigned by the algorithm be 

e -vL j>t 



p t(M L j,t}f=i el ) 



We have 



log 



Wn 



w t ' 



i < j < k+ |e|. 



r K+\e\ 

log [ J2 e ~ VLi:n ] -log(^+ |©| 



> log ( max e ' 

\i<j<K+\e\ 

= —n min Lj n 

l<3<K+\e\ 

= -V min L s K)-log(K + |e|). 



-iogOK- + |e|) 
log^ + iei) 



(A.l) 



(A.2) 
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Table 1: Entropy estimates for sensors from the Intel Research Lab, Berkeley. 



Method 


Sensor Numbers 


Entropy Estimate 


Max. triplet 


1, 2, 8 


4.0732 


Random 


15, 7, 2 


2.9340 


Random 


2, 4, 13 


3.3720 


Random 


10, 11, 6 


3.4966 


Random 


2, 10, 8 


3.5630 


Random 


5, 7, 1 


3.7798 


Random 


7, 9, 15 


3.8290 


Random 


1, 9, 14 


3.8511 


Random 


2, 9, 4 


3.8528 


Random 


11, 10, 1 


3.8570 


Random 


12, 7, 2 


3.8730 


Min. triplet 


15, 5, 7 


2.4758 



Moreover, 



log I — £i = log ^ ? ' 



log I e -Vd(P J t+1 ,x t+ i) 

\ 3=1 Wt 

[k+\q\ 

log ( y: p, (M^r) e- 



nd(pP t+1 ,x t+ i) 



K+\e\ 2 2 



(A.3) 



where the last inequality follows from assuming the distance function <i(-, ■) 
is bounded by some d max , hence —r]d(pl +1 , x t +i) is in the range [— r]d max , 0], 
and the extension to Hoeffding's inequality given in [32J, which asserts that 
for any random variable Z taking values in a bounded interval of size R and 
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Table 2: Entropy estimate results of 20 independent drawings of 5 out of 10 sensors. 



Draw Number 


Entropy Estimate 


Draw Number 


Entropy Estimate 


1 


3.9938 


11 


3.9899 


2 


3.9938 


12 


2.9966 


3 


3.9938 


13 


4.9829 


4 


3.9915 


14 


3.9938 


5 


3.9938 


15 


3.9938 


6 


2.9970 


16 


4.9829 


7 


1.9976 


17 


4.9829 


8 


4.9829 


18 


2.9966 


9 


3.9895 


19 


2.9943 


10 


3.9938 


20 


3.9938 



mean /i we have 



Thus, 



l0£ 



Wn 

W 



n-l 



t=0 



logE[e z ] < n 
Wt+i 



R 2 



„_i K+\e\ 



< -^EE Pt(j\{L J Af=X m )d(pl +l ,x t+1 ) + 



t=0 j=l 



(A.4) 



where the expectation in ( A.4[ ) is with respect to the randomized choices the 
algorithm takes. Finally, from (A. 2) and (A.4), we have, for any sequence 



x 



1 ! 



E[L S W)] < min L 8 (&) + MK±M + 
1 sK in ~ sesuse V 11 nr) 8 



(A.5) 



Since rj is any non-negative parameter, we may optimize the right hand side 



of (|A.5|) with respect to rj. The proposition follows by choosing 

/81og(AT+ |0|) 



V 
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